Mt Mckinley
VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
Liu, Hongcheng, Hou, Yixuan, Liu, Heyang, Wang, Yuhao, Wang, Yanfeng, Wang, Yu
While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > France (0.05)
- North America > Canada (0.04)
- (30 more...)
- Information Technology (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
- Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (0.34)
- Health & Medicine > Therapeutic Area > Musculoskeletal (0.34)
A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users
Balepur, Nishant, Shu, Matthew, Sung, Yoo Yeon, Goldfarb-Tarrant, Seraphina, Feng, Shi, Yang, Fumeng, Rudinger, Rachel, Boyd-Graber, Jordan Lee
To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.
- Asia > Middle East > Jordan (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (12 more...)
- Education (1.00)
- Leisure & Entertainment (0.92)
- Information Technology (0.67)
- Government (0.67)
Trump latest: Migration crackdown, DeepSeek's rise, what's ahead on Tuesday
United States President Donald Trump signed a series of executive orders on Monday aimed at reshaping military policies, including the removal of diversity, equity and inclusion programmes (DEI), reinstating service members discharged for refusing COVID-19 vaccines, and barring transgender people from military service. Earlier in the day, newly confirmed Secretary of Defense Pete Hegseth, who secured the position after a narrow Senate vote, said he would ensure the orders "are complied with rapidly and quickly". Here is the latest news from Monday and a look ahead for the week. Speaking with reporters on board Air Force One on Monday, Trump said that he signed four executive orders. Among those, Trump revealed he signed an order to establish a framework for developing what his administration calls an "American Iron Dome," a missile defence system designed to protect the homeland.
- Asia > India (0.30)
- North America > Mexico (0.15)
- Asia > China (0.07)
- (7 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)
Trump to declare national emergency at border in flurry of day one orders
In a series of calls with reporters on Monday morning, incoming Trump administration officials outlined dozens of executive orders the president-elect planned to take when he officially takes office, including 10 focused on what one official described as "common sense immigration policy". Officials said that Trump plans to end birthright citizenship, meaning that the children of undocumented migrants living in the US will no longer automatically be considered US citizens. Birthright citizenship, however, is enshrined in the US constitution and would require a two-thirds vote in both chambers of Congress to change. The official provided no further detail on how Trump plans to accomplish this. As part of the national emergency designation at the border, Trump will also direct the Department of Defense to "seal the border" and surge additional resources and personnel, including counter-drone capabilities.
- North America > Mexico (0.39)
- North America > United States > Alaska > Denali Borough > Mt Mckinley (0.07)
- North America > Canada (0.07)
- (2 more...)
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective
Chen, Meiqi, Cao, Yixin, Zhang, Yan, Lu, Chaochao
Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research. Our project page is at https://opencausalab.github.io/MORE.
- Africa > South Africa (0.04)
- North America > United States > Alaska > Denali Borough > Mt Mckinley (0.04)
- Europe > France (0.04)
- (15 more...)
- Transportation > Ground (0.96)
- Automobiles & Trucks > Manufacturer (0.73)
- Leisure & Entertainment > Sports > Soccer (0.69)
Knowledge Graph Enhanced Large Language Model Editing
Zhang, Mengqi, Ye, Xiaotian, Liu, Qiang, Ren, Pengjie, Wu, Shu, Chen, Zhumin
Large language models (LLMs) are pivotal in advancing natural language processing (NLP) tasks, yet their efficacy is hampered by inaccuracies and outdated knowledge. Model editing emerges as a promising solution to address these challenges. However, existing editing methods struggle to track and incorporate changes in knowledge associated with edits, which limits the generalization ability of postedit LLMs in processing edited knowledge. To tackle these problems, we propose a novel model editing method that leverages knowledge graphs for enhancing LLM editing, namely GLAME. Specifically, we first utilize a knowledge graph augmentation module to uncover associated knowledge that has changed due to editing, obtaining its internal representations within LLMs. This approach allows knowledge alterations within LLMs to be reflected through an external graph structure. Subsequently, we design a graph-based knowledge edit module to integrate structured knowledge into the model editing. This ensures that the updated parameters reflect not only the modifications of the edited knowledge but also the changes in other associated knowledge resulting from the editing process. Comprehensive experiments conducted on GPT-J and GPT-2 XL demonstrate that GLAME significantly improves the generalization capabilities of post-edit LLMs in employing edited knowledge.
- Europe > Sweden (0.05)
- Africa (0.05)
- North America > United States > Alaska > Denali Borough > Mt Mckinley (0.04)
- (2 more...)
Symbol tuning improves in-context learning in language models
Wei, Jerry, Hou, Le, Lampinen, Andrew, Chen, Xiangning, Huang, Da, Tay, Yi, Chen, Xinyun, Lu, Yifeng, Zhou, Denny, Ma, Tengyu, Le, Quoc V.
We present symbol tuning - finetuning language models on in-context input-label pairs where natural language labels (e.g., "positive/negative sentiment") are replaced with arbitrary symbols (e.g., "foo/bar"). Symbol tuning leverages the intuition that when a model cannot use instructions or natural language labels to figure out a task, it must instead do so by learning the input-label mappings. We experiment with symbol tuning across Flan-PaLM models up to 540B parameters and observe benefits across various settings. First, symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels. Second, symbol-tuned models are much stronger at algorithmic reasoning tasks, with up to 18.2% better performance on the List Functions benchmark and up to 15.3% better performance on the Simple Turing Concepts benchmark. Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior semantic knowledge.
- Europe > United Kingdom (0.27)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- (50 more...)
Generating Data for Symbolic Language with Large Language Models
Ye, Jiacheng, Li, Chengzu, Kong, Lingpeng, Yu, Tao
While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.
- North America > United States > Montana (0.05)
- North America > United States > Alabama (0.05)
- Asia > Vietnam (0.04)
- (20 more...)
101 NumPy Exercises for Data Analysis (Python) - Machine Learning Plus
The goal of the numpy exercises is to serve as a reference as well as to get you to apply numpy beyond the basics. The questions are of 4 levels of difficulties with L1 being the easiest to L4 being the hardest. If you want a quick refresher on numpy, the numpy basics and the advanced numpy tutorials might be what you are looking for. Q. Import numpy as np and print the version number. You must import numpy as np for the rest of the codes in this exercise to work.